Member 1: Daniel Rodriguez-Gonzalez
Member 2: Darrel Pyle
Member 3: Josh Ruiz
We will use the grouplens.org movielens dataset in conjunction with data from the Internet Movie Database (IMDb) to build a recommendation system that takes into account actors, movie certification, and IMDb ratings with user ratings from movieLens. We will test to see if the additional IMDb data adds performance value to the recommender.
We also tested a recommendation system that would recommend similar actors based on their performances in particular movie genres. The target in this case is still the user's ratings.
The performance of each model will be evaluated via a precision-recall curve. The model that scores higher on both of these measures is preferred. A 80/20 train/test split is used to train and then test models and to generate the precision-recall curves.
In this case, we wish to build a model that can be used by a streaming service (Netflix, AppleTV, Amazon Prime) to recommend movies to users based on their ratings of other movies.
As a company would initially run the model with their user rating data and present other movies to an user. As users enter more ratings, the model can be refreshed so that the company offers better recommendations to its users.
Gathering twitter streams to evaluate sentiment analysis on movies could also provide a measure of success for each movie. Also, in this analysis we incorporated side data for the items, we would also like to be able to incorporate side data on the users like Age, gender, geographic location, socio-economic status, etc.
In this case, a casting agent would have a particular actor in mind for a role, but they would also like to have a recommendation on actors that would also have similar ratings in the genre space of the movie they are making. This could provide an unbiased approach to selecting actors based on their past performances with user ratings.
This can be a tool for studios given to their casting managers. It can could be provided as an app that is updated as more user ratings are available and more movies are made.
Gross ticket sales by actor could help in determining which actor can generate the most revenue for a movie. Actor salaries would also be beneficial in balancing movie budgets. A recommendation system could be developed to help structure a movie financially.
It was surprising that the best model was an item_similarity model despite downloading extra data for an user-item model. The additional side data for the items (movies) did improve the model over the model that did not include this data, but the item-item models always scored higher without the side data.
In some case the recommendation make a lot of sense, for example, Bruce Willis is paired with Jason Statham. However, the models overall do not show high precision or recall. We believe that this model can be improved with more data and analysis.
# Import libraries needed for processing and visualizations
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
From the README.txt file included with the MovieLens dataset:
This dataset describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 105339 ratings and 6138 tag applications across 10329 movies. These data were created by 668 users between April 03, 1996 and January 09, 2016. This dataset was generated on January 11, 2016.
Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.
The data are contained in four files, links.csv, movies.csv, ratings.csv and tags.csv.
#Importing dataset from MovieLens
df_links = pd.read_csv('links.csv')
df_tags = pd.read_csv('tags.csv')
df_ratings = pd.read_csv('ratings.csv')
df_movies = pd.read_csv('movies.csv')
The links file contains 10,329 rows and 3 columns. The movieId is a unique identifier for movies. The imdbId is an indentifier for movies in the IMdb website and the tmdbId column identifies movies in the themoviedb site.
Each row in the links file is a unique movie.
print (df_links.shape)
print (df_links.movieId.unique().shape)
df_links.head()
The movies file also contains 10,329 unique rows, each identifying a movie. The other columns detail the title of the movie and the genres for the movie.
print (df_movies.shape)
print (df_movies.movieId.unique().shape)
df_movies.head()
The ratings file contains 4 columns and 105,339 entries.
There are only 10,325 unique movies represented in the dataset with 668 unique users' ratings.
print (df_ratings.shape)
print (df_ratings.movieId.unique().shape)
print (df_ratings.userId.unique().shape)
df_ratings.head()
As we can see from the Histogram below of all ratings in the ratings.csv file, the ratings are entered in increments of 0.5 and the distribution shows a left skew.
df_ratings.rating.hist(bins = 40)
plt.title('Ratings Histogram')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()
Also, the ratings file records ratings assigned to movies by users, as can be seen below when we group by userId and movieId.
print (df_ratings.groupby(['userId','movieId'])['rating'].count().unique())
df_ratings.groupby(['userId','movieId']).count().head(10)
The tags file associates a textual tag that an user assigned to a particular movie.
There are 6,138 tags in the dataframe.
print (df_tags.shape)
df_tags.head()
If we group by userId, movieId, and tag, we can see that users have tagged several movies and some movies are tagged more than once.
df_tags.groupby(['userId','movieId','tag']).count().head(10)
#Install imdbpie via pip package manager
#!pip install imdbpie
from imdbpie import Imdb
imdb = Imdb()
imdb = Imdb(anonymize=True) # to proxy requests
# Creating an instance with caching enabled
# Note that the cached responses expire every 2 hours or so.
# The API response itself dictates the expiry time)
imdb = Imdb(cache=True)
If we perform a Left Join between the links and movies file on movieId, we can associate all movies with its imdbId and movie title as shown below.
df_Link_Movie_join = pd.merge(df_links, df_movies, how='left',on='movieId')
print (df_Link_Movie_join.shape)
df_Link_Movie_join.head()
Furthermore, if we perform another Left Join between our ratings file and the joined Links & Movies file from above, we get a dataframe that incorporates userId, movieId, rating, imdbId, title, and genres for all movies that are rated by an user.
df_Ratings_Link_Movies_join = pd.merge(df_ratings, df_Link_Movie_join, how='left',on='movieId')
print (df_Ratings_Link_Movies_join.shape)
df_Ratings_Link_Movies_join.head()
To better understand the imdb API, we will query a movie based on its imdbId.
In this case, we take the imdbId of Casino (1995) to verify how the imdbId works and observe what information can be pulled from the API.
Note that the imdbID required zero padding and the 'tt' prefix for it to be a valid query
title = imdb.get_title_by_id("tt0112641")
print ('Title:',title.title)
print ('Rating:',title.rating)
print ('Certification:',title.certification)
print ('First Entry in Cast List:',title.cast_summary[0].name)
print ('First Entry in Cast List ID:',title.cast_summary[0].imdb_id)
print ('Length of the Cast List:',len(title.cast_summary))
For our recommendation engine, we will incorporate the imdb_rating, certification (PG, PG-13, R, etc.), and the top 4 actors in the movie as features.
The commented code below is used to download the above information for each movie. Since there are 10,329 movies, we split the downloads amongst ourselves as it would have taken more then 4.5 hours to download all movie info from one computer.
While downloading the data, we encountered several issues/error when the movie did not exist or actor information was unavailable, for these reasons, the code below was adjusted as we continued the download process.
# Used only for preprocessing, commented out for final runs
# df_links['imdb_rating'] = np.nan
# df_links['cert'] = np.nan
# df_links['Actor_0'] = np.nan
# df_links['Actor_1'] = np.nan
# df_links['Actor_2'] = np.nan
# df_links['Actor_3'] = np.nan
# Download ranges
# Daniel [0,3300]
# Josh [3300, 6600]
# Darrel [6600, 10329]
# Used only for preprocessing, commented out for final runs
# for i in range(0,10329):
# movie = df_links['imdbId'][i]
# title = imdb.get_title_by_id("tt"+str(movie).zfill(7))
# print (i, movie, title)
# if (i == 8030 or i == 8659 or i == 9753 or i == 10047 or i == 10328) :
# df_links['imdb_rating'].iloc[i] = np.nan
# df_links['cert'].iloc[i] = np.nan
# df_links['Actor_0'].iloc[i] = np.nan
# df_links['Actor_1'].iloc[i] = np.nan
# df_links['Actor_2'].iloc[i] = np.nan
# df_links['Actor_3'].iloc[i] = np.nan
# else:
# df_links['imdb_rating'].iloc[i] = title.rating
# df_links['cert'].iloc[i] = title.certification
# for j in range(0,len(title.cast_summary)):
# df_links['Actor_'+str(j)].iloc[i] = title.cast_summary[j].name
# Used only for preprocessing, commented out for final runs
# df_links.to_csv('df_links_imdb_0_3300.csv',encoding = 'utf-8')
The code below concatenates our 3 download files from IMDb into 1.
# Daniel df_links_imdb_0_3300
df_imdb_1 = pd.read_csv('df_links_imdb_0_3300.csv')
print 'Set 1 Before:', df_imdb_1.shape
df_imdb_1.dropna(thresh=6, inplace=True)
print 'Set 2 After:', df_imdb_1.shape
# Josh df_links_imdb_3300_6600.csv
df_imdb_2 = pd.read_csv('df_links_imdb_3300_6600.csv')
print 'Set 2 Before:', df_imdb_2.shape
df_imdb_2.dropna(thresh=6, inplace=True)
print 'Set 2 After:', df_imdb_2.shape
# Darrel df_links_imdb_6600_End.csv
df_imdb_3 = pd.read_csv('df_links_imdb_6600_End.csv')
print 'Set 3 Before:', df_imdb_3.shape
df_imdb_3.dropna(thresh=6, inplace=True)
print 'Set 3 After:', df_imdb_3.shape
#Concatenating all 3 files into 1
df_imdb = df_imdb_1.append([df_imdb_2, df_imdb_3])
# Delete unused dataframes to reduce memory usage and avoid confusion
del df_imdb_1
del df_imdb_2
del df_imdb_3
#Since we used the df_links dataframe to query from the API, we no longer need the df_links information
#We will keep the movieID (key), and data specifically from IMDb
df_imdb = df_imdb[['movieId','imdb_rating','cert','Actor_0','Actor_1','Actor_2','Actor_3']]
print (df_imdb.shape)
df_imdb.head()
Below, we confirm that no duplicate values are present in the df_imdb
# review results to ensure no duplicate movieId values exist
# no rows will be returned if there are no duplicate values
groups = df_imdb.groupby(by=['movieId'])
groups.filter(lambda x: len(x) > 1).sort_values(by='movieId')
First, we perform another left join between our dataframe containing Ratings, Links, and Movies with out new IMDb data.
df_Ratings_Link_Movies_imdb_join = pd.merge(df_Ratings_Link_Movies_join, df_imdb, how='left',on='movieId')
print (df_Ratings_Link_Movies_imdb_join.shape)
df_Ratings_Link_Movies_imdb_join.head()
We will now rename our dataframe and remove the timestamp, imdbId, tmdId columns because our recomendations will not be based on time or IDs.
df = df_Ratings_Link_Movies_imdb_join
df = df[['userId','movieId','rating','title','genres','imdb_rating','cert','Actor_0','Actor_1','Actor_2','Actor_3']]
df.head()
The information below simply gave us an idea of how the data was distributed in the dataset
df.info()
df.rating.hist(bins = 60)
plt.title('MovieLens Ratings Histogram')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.xlim([0,5])
plt.show()
df.imdb_rating.hist(bins = 60)
plt.title('IMDb Ratings Histogram')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.xlim([0,10])
plt.show()
df.groupby('cert').count().sort_values('movieId', ascending=False).head(10)
df.groupby('userId').count().sort_values('movieId', ascending=False).head(10)
The code below is commented out because sometimes it causes the kernel to freeze
#df.groupby('genres').count().sort_values('movieId', ascending=False).head(10)
#df.groupby('Actor_0').count().sort_values('movieId', ascending=False).head(10)
#df.groupby('Actor_1').count().sort_values('movieId', ascending=False).head(10)
#df.groupby('Actor_2').count().sort_values('movieId', ascending=False).head(10)
#df.groupby('Actor_3').count().sort_values('movieId', ascending=False).head(10)
From the GraphLab documentation: The user id and item id columns must be of type ‘int’ or ‘str’. The target column must be of type ‘int’ or ‘float’.
This is verified below. Also, once we drop rows containing null values, graphlab will not throw an error.
df = df.dropna()
df.info()
item similarity models: item_similarity_recommender
item content recommenders: item_content_recommender
factorization recommenders: factorization_recommender
A Factorization-based recommender that learns latent factors for each user and item and uses them to make rating predictions. This includes both standard matrix factorization as well as factorization machines models (in the situation where side data is available for users and/or items).
Supports side_data_factorization: Use factorization for modeling any additional features beyond the user and item columns. If True, and side features or any additional columns are present, then a Factorization Machine model is trained. Otherwise, only the linear terms are fit to these features. Default: True.
factortization recommenders for ranking: ranking_factorization_recommender
popularity-based recommenders: popularity_recommender
We wish to evaluate several types of recommenders for our movie recommendation model.
Specifically:
The performance of each model will be evaluated via a precision-recall curve. The model that scores highest on both measures is preferred.
We define the following variables below:
import graphlab as gl
gl.canvas.set_target('ipynb')
#Creating an SFrame with our movie data, df
data = gl.SFrame(data=df)
#Defining the side information for our movies/items
item_data = data[['title','genres','imdb_rating','cert','Actor_0','Actor_1','Actor_2','Actor_3']]
# Split the data into a single training and test set
train, test = gl.recommender.util.random_split_by_user(data,
user_id="userId",
item_id="title",
max_num_users=None, #None: use all available users for test set
item_test_proportion=0.2) #80/20 train/test split
The graphlab.recommender.create is a unified interface for training recommender models. Based on simple characteristics of the data, a type of model is selected and trained. The trained model can be used to predict ratings and make recommendations.
First, we will build two models based on the standard recommender by GraphLab:
The recommender.create selects a ranking factorization model.
#Train a model based on characteristics of the data, a ranking_factorization_recommender is selected
user_item_rec_create_noside = gl.recommender.create(train,
user_id="userId",
item_id="title",
target="rating")
The recommender.create selects a ranking factorization model.
#Train a model based on characteristics of the data, a ranking_factorization_recommender is selected by default
user_item_rec_create_side = gl.recommender.create(train,
user_id="userId",
item_id="title",
target="rating",
item_data=item_data) #side data included
Since the ranking factorization methods were chosen above, we will build two models, with and without, side data using a factorization recommender.
user_item_factor_noside = gl.factorization_recommender.create(train,
user_id="userId",
item_id="title",
target="rating")
#Train a model based on characteristics of the data, a ranking_factorization_recommender is selected by default
user_item_factor_side = gl.factorization_recommender.create(train,
user_id="userId",
item_id="title",
target="rating",
item_data=item_data) #side data included
Next, we will build a item similarity model. Currently, this type of model does not include side data.
item_item_noside = gl.recommender.item_similarity_recommender.create(train,
user_id="userId",
item_id="title",
target="rating",
similarity_type="cosine")
popularity_noside = gl.recommender.popularity_recommender.create(train,
user_id="userId",
item_id="title",
target="rating")
Although side item_data is included, the warning below indicates that these variables are ignored. However, the data preparation took more than 2x time.
popularity_side = gl.recommender.popularity_recommender.create(train,
user_id="userId",
item_id="title",
target="rating",
item_data=item_data) #side-data included
From the precision-recall plot below, the two popularity models score poorly, as well as the two factorization only models.
Out of the 7 models tested, 3 are viable. The item-item model performed best, followed by the recommender.create model with side data.
first_models = [user_item_rec_create_noside,
user_item_rec_create_side,
user_item_factor_noside,
user_item_factor_side,
item_item_noside,
popularity_noside,
popularity_side]
comparisonstruct = gl.compare(test,first_models)
gl.show_comparison(comparisonstruct,first_models)
In the following section, we will try several parameter combinations for each of the model types above to see if the models maintain their performance rankings in general.
5 models for each model type will be randomly selected from graphlab's random grid search method. We used a random grid search due to computing constraints.
# Define model parameters
params = {'user_id': 'userId',
'item_id': 'title',
'target': 'rating',
'item_data':[item_data,None],
'num_factors': [6, 12, 24],
'regularization':[1e-12,1e-8,1e-4,1],
'linear_regularization': [1e-12,1e-8,1e-4,1]}
fac_rec_gs = gl.model_parameter_search.random_search.create((train,test),
gl.recommender.factorization_recommender.create,
params,
max_models=5,
environment=None)
fac_rec_gs.get_results()
# Define model parameters
params = {'user_id': 'userId',
'item_id': 'title',
'target': 'rating',
'item_data':[item_data,None],
'num_factors': [6, 12, 24],
'regularization':[1e-12,1e-8,1e-4,1],
'linear_regularization': [1e-12,1e-8,1e-4,1],
'ranking_regularization':[0, 0.1, 0.5, 1]}
ranking_fac_rec_gs = gl.model_parameter_search.random_search.create((train,test),
gl.recommender.ranking_factorization_recommender.create,
params,
max_models=5,
environment=None)
ranking_fac_rec_gs.get_results()
# Define model parameters
params = {'user_id': 'userId',
'item_id': 'title',
'target': 'rating',
'similarity_type':['jaccard','cosine','pearson'],
'only_top_k':[5,10,25,64]}#only_top_k: Number of similar items to store for each item. Default value is 64.
#Decreasing this decreases the amount of memory required for the model,
#but may also decrease the accuracy.
item_item_gs = gl.model_parameter_search.random_search.create((train,test),
gl.recommender.item_similarity_recommender.create,
params,
max_models=5,
environment=None)
#all models are placed into the list below
model_List = []
#factorization_recommender
fac_rec = fac_rec_gs.get_models()
#ranking_factorization_recommender
ranking_fac_rec = ranking_fac_rec_gs.get_models()
#item_similarity recommender
item_item_models = item_item_gs.get_models()
#first set of models in list are model_0-4
model_List = [model for model in fac_rec]
#second set of models in list are model_5-9
model_List = model_List + [model for model in ranking_fac_rec]
#second set of models in list are model_10-14
model_List = model_List + [model for model in item_item_models]
comparison_struct = gl.compare(test, model_List)
gl.show_comparison(comparison_struct, model_List)
As expected, the item-item recommender model outperforms all other models consistently.
The models below show the results of an item_similarity model. The results are not unexpected and seem to make sense.
results = item_item_noside.get_similar_items(k=5)
results
results[results['title'] == 'Pulp Fiction (1994)']
#Original Dataframe
print (df.shape)
df.head()
First, we wish to include all actors in a single columns for the movie they participate in
data_1=data['genres', 'movieId', 'rating', 'Actor_0']
data_1.rename({'Actor_0': 'Actor'})
data_2=data['genres', 'movieId', 'rating', 'Actor_1']
data_2.rename({'Actor_1': 'Actor'})
data_3=data['genres', 'movieId', 'rating', 'Actor_2']
data_3.rename({'Actor_2': 'Actor'})
data_4=data['genres', 'movieId', 'rating', 'Actor_3']
data_4.rename({'Actor_3': 'Actor'})
actor_genres = data_1.append(data_2).append(data_3).append(data_4)
print actor_genres.shape
actor_genres.head()
The code below takes a step further and splits the genres and associates them with the actors
df_genre = actor_genres.to_dataframe()
s = df_genre['genres'].str.split('|').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'GenreSplit'
actor_genre_split = gl.SFrame(data=pd.concat([df_genre, s.to_frame()], axis=1, join='inner'))
actor_genre_split.head()
# Split the data into a single training and test set
train, test = gl.recommender.util.random_split_by_user(actor_genres,
user_id="genres",
item_id="Actor",
max_num_users=None, #None: use all available users for test set
item_test_proportion=0.2) #80/20 train/test split
# Define model parameters
params = {'user_id': 'genres',
'item_id': 'Actor',
'target': 'rating',
'similarity_type':['jaccard','cosine','pearson'],
'only_top_k':[5,10,25,64]}
item_item_actor_gs = gl.model_parameter_search.random_search.create((train,test),
gl.recommender.item_similarity_recommender.create,
params,
max_models=5,
environment=None)
item_item_actor_gs.get_results()
# Define model parameters
params = {'user_id': 'genres',
'item_id': 'Actor',
'target': 'rating',
'num_factors': [6, 12, 24],
'regularization':[1e-12,1e-8,1e-4,1],
'linear_regularization': [1e-12,1e-8,1e-4,1],
'ranking_regularization':[0, 0.1, 0.5, 1]}
ranking_fac_rec_actor_gs = gl.model_parameter_search.random_search.create((train,test),
gl.recommender.ranking_factorization_recommender.create,
params,
max_models=5,
environment=None)
ranking_fac_rec_actor_gs.get_results()
item_item_actor_genre = item_item_actor_gs.get_models()
ranking_fac_rec_actor_genre = ranking_fac_rec_actor_gs.get_models()
actor_genre_model_List = []
#first set of models in list are model_0-4
actor_genre_model_List = [model for model in item_item_actor_genre]
#second set of models in list are model_5-9
actor_genre_model_List = actor_genre_model_List + [model for model in ranking_fac_rec_actor_genre]
comparison_Actor = gl.compare(test, actor_genre_model_List)
gl.show_comparison(comparison_Actor, actor_genre_model_List)
From the above precision-recall plot, we can see that model_1, which belongs to the item_similarity recommender performs better than the other models. The models near model_1 are all item_similarity models.
first set of models in list are model_0-4: item_similarity
second set of models in list are model_5-9: ranking_fact
actor_genre_model_List[1]
actor_genre_rec = gl.recommender.item_similarity_recommender.create(actor_genres,
user_id="genres",
item_id="Actor",
target="rating",
similarity_type='jaccard')
results = actor_genre_rec.get_similar_items(k=2)
results
Although the precision recall scores are not high, some of the results here make sense. Bruce Willis is very similar to Jason Statham. This model does warrent further improvements, however.
actor_genre_split.head()
# Split the data into a single training and test set
train, test = gl.recommender.util.random_split_by_user(actor_genre_split,
user_id="GenreSplit",
item_id="Actor",
max_num_users=None, #None: use all available users for test set
item_test_proportion=0.2) #80/20 train/test split
# Define model parameters
params = {'user_id': 'GenreSplit',
'item_id': 'Actor',
'target': 'rating',
'similarity_type':['jaccard','cosine','pearson'],
'only_top_k':[5,10,25,64]}
item_item_actor_genre_split_gs = gl.model_parameter_search.random_search.create((train,test),
gl.recommender.item_similarity_recommender.create,
params,
max_models=5,
environment=None)
# Define model parameters
params = {'user_id': 'GenreSplit',
'item_id': 'Actor',
'target': 'rating',
'num_factors': [6, 12, 24],
'regularization':[1e-12,1e-8,1e-4,1],
'linear_regularization': [1e-12,1e-8,1e-4,1],
'ranking_regularization':[0, 0.1, 0.5, 1]}
ranking_fac_rec_actor_genre_split_gs = gl.model_parameter_search.random_search.create((train,test),
gl.recommender.ranking_factorization_recommender.create,
params,
max_models=5,
environment=None)
item_item_actor_genre_split = item_item_actor_genre_split_gs.get_models()
ranking_fac_rec_actor_genre_split = ranking_fac_rec_actor_genre_split_gs.get_models()
actor_genre_split_model_List = []
#first set of models in list are model_0-4
actor_genre_split_model_List = [model for model in item_item_actor_genre_split]
#second set of models in list are model_5-9
actor_genre_split_model_List = actor_genre_split_model_List + [model for model in ranking_fac_rec_actor_genre_split]
comparison_Actor = gl.compare(test, actor_genre_split_model_List)
gl.show_comparison(comparison_Actor, actor_genre_split_model_List)
The recall results here are so small that recall only shows 0 for all values. Precision reaches a maximum of 0.1, achieved by an item_similarity model.
We model with an item_similarity model below to see the results.
actor_genre_split_rec = gl.recommender.item_similarity_recommender.create(actor_genre_split,
user_id="GenreSplit",
item_id="Actor",
target="rating",
similarity_type='jaccard')
results = actor_genre_split_rec.get_similar_items(k=2)
results
These results do not appear to do as well as the models where the genre is not split.